Normative Addendum 1 embodies C's reaction to both the limitations
and promises of international character sets.
Digraphs and the
<iso646.h>
header were meant to improve the appearance of C
programs written in national variants of ISO 646 without, e.g., {
or }
characters.
On the other end of the spectrum, the facilities
connected to <wchar.h>
and <wctype.h>
extend the old Standard's barely adequate basis into a complete and
consistent set of utilities for handling wide characters and multibyte strings.
This document summarizes Normative Addendum 1. It is intended to quickly inform readers who are already familiar with the Standard; it does not, and cannot, introduce the complex subject matter behind NA1, nor can it replace the original document as a reference manual. (Nevertheless, it tries to be as accurate as possible, and its author would like to hear about any errors or omissions.)
STDC
_VERSION
__ shall expand to
199409L
.
(The Normative Addendum was formally registered with ISO in September 1994.)
<: :> <% %> %: %:%:These tokens behave identically to the tokens and preprocessing tokens:
[ ] { } # ##respectively (except that they are spelled differently, and so stringize differently).
%:
and %:%:
.
#define and && #define and_eq &= #define bitand & #define bitor | #define compl ~ #define not ! #define not_eq != #define or || #define or_eq |= #define xor ^ #define xor_eq ^=These macro names are reserved for all purposes in translation units that include the header, but are not reserved in those that do not (this is the same as for any other Standard macros).
wchar
_t
.
Not all code values have to represent a character;
those that do not must not appear in wide
strings that are converted to multibyte characters.
Code value 0 is reserved for the ``end of string''
indicator.
char
).
A character can have representations in more than
one state, and can have more than one representation
in any given state. The representation
in different states can differ.
Not all byte sequences are necessarily valid;
an invalid sequence causes an encoding error
when interpreted (normally shown by setting errno
to EILSEQ
).
However, for encodings used by other library functions, there are further restrictions:
fwprintf
);
all these identifiers are declared by <wctype.h>
or <wchar.h>
.
These identifiers are reserved with external linkage in
all the translation units of a program if and only if
any translation unit includes either of those
headers (thus changes in one translation unit may cause another
translation unit to invoke undefined behavior).
EILSEQ
is added to the list of error
conditions (currently this list consists of EDOM
and ERANGE
).
typedef
... wint
_t;
WEOF
(described
below).
It can be the same type as wchar
_t
.
typedef
... wctrans
_t;
typedef
... wctype
_t;
wctype
_t
represents a
classification of characters (like ``is lower case'' or
``is accented''), while wctrans
_t
represents a character conversion (like ``change to upper case'' or
``remove any accent'').
wint
_t
.
It need not be negative nor equal EOF
,
but it serves the same purpose:
the value, which must not be a valid wide character, is used to
represent an end of file or as an error indication.
LC
_CTYPE
category
of the current locale.
int iswalnum
(wint
_t);
int iswalpha
(wint
_t);
int iswcntrl
(wint
_t);
int iswdigit
(wint
_t);
int iswgraph
(wint
_t);
int iswlower
(wint
_t);
int iswprint
(wint
_t);
int iswpunct
(wint
_t);
int iswspace
(wint
_t);
int iswupper
(wint
_t);
int iswxdigit
(wint
_t);
WEOF
or representable as
a wchar
_t
.
The function will
return nonzero if and only if the argument is a wide character of the
appropriate type.
The types are the same as for the <ctype.h>
functions, except that iswprint
and iswgraph
are guaranteed to return false not only for
space (as their char
counterparts do),
but for any character
that iswspace()
considers white space.
Thus isgraph('\t')
is true,
but iswgraph(L'\t')
is false.
For the remaining nine functions the expression
(!isXXXXX(wctob(wc))
||
iswXXXXX(wc))
is true for every wide
character.
That is, for any wide character which has a corresponding
singlebyte character (which is what
wctob
returns),
if the latter has the given property, then so does the
former.
Note that this is not a symmetric relationship.
wctype
_t wctype
(const char *);
int iswctype
(wint
_t, wctype
_t);
is
XXXXX
or isw
XXXXX functions to
test for other properties (e.g. ``is a katakana character''),
it was felt that this cluttered the namespace (though the names are
all reserved) without being flexible enough for
future needs.
Instead, the committee introduced a mechanism that can be extended
at run-time.
wctype()
names a category to test for; wctype()
returns a wctype
_t
magic cookie that can
be handed to iswctype
to test for the
named category, or zero if it does not recognize the
category.
The eleven builtin categories "alnum"
,
"alpha"
, ... "xdigit"
must be recognized by all
implementations.
Thus, iswctype(ch, wctype("punct"))
is the same as
iswpunct(ch)
.
The wctype
_t
value is only valid for the
LC
_CTYPE
category used to create it.
wint
_t towlower
(wint
_t);
wint
_t towupper
(wint
_t);
toupper
and tolower
. toupper('é')
==
'E'
, towupper(L'é')
==
L'É'
wctrans
_t wctrans
(const char *);
wint
_t towctrans
(wint
_t, wctrans
_t);
wctype()
and
iswctype()
provide extensible tests.
struct tm;
typedef
... size
_t;
typedef
... wchar
_t;
typedef
... wint
_t;
#define NULL
...#define WEOF
...
struct
tm
;
it is still necessary to include <time.h>
before defining a
variable of this type.
typedef
... mbstate
_t;
WCHAR
_MAX
and
WCHAR
_MIN
wchar
_t
can hold.
They are integral
constant expressions of type wchar
_t
,
but not necessarily valid
as wide characters.
For example, if wchar
_t
is a typedef for
unsigned
short
, then
WCHAR_MIN
will be zero
and WCHAR
_MAX
will
be the same as USHRT_MAX
.
mbstate
_t
, and an orientation;
it can be byteoriented, wideoriented,
or unoriented.
When a stream is opened (including stdin
etc.,
and calls to freopen
), it is
unoriented.
The functions ungetc
, fgetc
,
fputc
, and those defined to work though them,
change an unoriented stream to byteoriented, and shall
not be called on a wideoriented stream.
The functions ungetwc
, fgetwc
,
fputwc
, and those defined to work though them,
change an unoriented stream to wideoriented,
and shall not be called on a byteoriented stream.
Wide binary streams shall obey the positioning restrictions of both text and binary streams. Positioning a wideoriented stream within the middle of an existing character representation and then writing makes all following contents undefined.
The mbstate
_t
object associated with a
stream is saved by fgetpos
and restored
by fsetpos
.
The object is initialized when the stream is opened as if it were
an object declared with static lifetime (i.e. all
zeroes and null pointers).
The *scanf
and *printf
functions have the ability to handle strings of
the opposite type to the majority (that is,
wide strings in fprintf
etc.
and multibyte strings in fwprintf
etc.).
These strings are converted to the majority form before
(for *printf
) or after (for *scanf
)
any other processing.
This conversion is done as if using calls to
mbrtowc
or
wcrtomb
,
but with an mbstate
_t
object set to the initial state before each
such conversion.
wint
_t fgetwc
(FILE *);
mbrtowc
(using the stream's
mbstate
_t
object) until a complete wide
character has been read, or an error
occurs.
The character or WEOF
is returned; the latter can indicate
end of file (the eof indicator is set), a read error (the error
indicator is set), or a conversion error (errno
is set to
EILSEQ
).
All other wide character
input is done as if via fgetwc
.
wint
_t fputwc
(wchar
_t, FILE *);
wcrtomb
(using the
stream's mbstate
_t
object)
and writes the resulting bytes to the stream.
The character or WEOF
is
returned; the latter can indicate a write error (the error
indicator is set) or a conversion error
(errno
is set to EILSEQ
).
All other wide character output is done as if via fputwc
.
fprintf
(and
printf
and sprintf
):
%lc
,
which requires a wint
_t
argument,
and %ls
,
which requires a wchar_t
*
argument.
%lc
is equivalent to %ls
called with
a two element array (the argument in
the first element, and zero in the second).
%ls
converts the wide characters to bytes;
the precision indicates the maximum number of bytes
written (conversion will also stop on a zero wide character);
a partial multibyte character will not be output,
though complete trailing shift sequences might be.
fscanf
(and
scanf
and sscanf
):
%lc,
%ls,
and %l[
;
all take a pointer to wchar
_t
,
and convert the input to multibyte representation after
matching.
(The qualified and unqualified conversions match the same input.)
int fwprintf
(FILE *, const wchar
_t *, ...);
int wprintf
(const wchar
_t *, ...);
int swprintf
(wchar
_t *, size
_t, const wchar
_t *, ...);
int vfwprintf
(FILE *, const wchar
_t *, va
_list);
int vwprintf
(const wchar
_t *, va
_list);
int vswprintf
(wchar
_t
*,
size
_t,
const wchar
_t*, va
_list);
fprintf
,
including the extensions
above.
With %c
, the character is converted
using btowc
;
with %s
, the string
is converted to wide characters before output.
With all formats, width and precision are measured in wide
characters.
The second argument of
swprintf
is the the number of elements
of the destination array
(including the terminating zero which is always written).
int fwscanf
(FILE *, const wchar
_t *, ...);
int wscanf
(const wchar
_t *, ...);
int swscanf
(const wchar
_t *, const wchar
_t *, ...);
fscanf
,
including the extensions above.
With %c
, %s
, and %[
,
the accepted input field will be converted
to its multibyte equivalent after being matched.
With all formats, width and precision are
measured in wide characters.
wchar
_t *fgetws
(wchar
_t *, int, FILE *);
int fputws
(const wchar
_t *, FILE *);
wint
_t getwc
(FILE *);
wint
_t getwchar
(void);
wint
_t putwc
(wchar
_t, FILE *);
wint
_t putwchar
(wchar
_t);
wint
_t ungetwc
(wint
_t, FILE *);
getwc
and putwc
's FILE
*
argument.)
int fwide (FILE *, int);
double wcstod
(const wchar
_t *, wchar
_t **);
long int wcstol
(const wchar
_t *, wchar
_t **, int);
unsigned
long
int wcstoul
(const
wchar
_t*,
wchar
_t**,
int);
wchar
_t *wcscpy
(wchar
_t *, const wchar
_t *);
wchar
_t *wcsncpy
(wchar
_t *, const wchar
_t *, size
_t);
wchar
_t *wcscat
(wchar
_t *, const wchar
_t *);
wchar
_t *wcsncat
(wchar
_t *, const wchar
_t *, size
_t);
int wcscmp
(const wchar
_t *, const wchar
_t *);
int wcscoll
(const wchar
_t *, const wchar
_t *);
int wcsncmp
(const wchar
_t *, const wchar
_t *, size
_t);
size
_t wcsxfrm
(wchar
_t *, const wchar
_t *, size
_t);
wchar
_t *wcschr
(const wchar
_t *, wchar
_t);
size
_t wcscspn
(const wchar
_t *, const wchar
_t *);
wchar
_t *wcspbrk
(const wchar
_t *, const wchar
_t *);
wchar
_t *wcsrchr
(const wchar
_t *, wchar
_t);
size
_t wcsspn
(const wchar
_t *, const wchar
_t *);
wchar
_t *wcsstr
(const wchar
_t *, const wchar
_t *);
size
_t wcslen
(const wchar
_t *);
wchar
_t *wmemchr
(const wchar
_t *, wchar
_t, size
_t);
int wmemcmp
(const wchar
_t *, const wchar
_t *, size
_t);
wchar
_t *wmemcpy
(wchar
_t *, const wchar
_t *, size
_t);
wchar
_t *wmemmove
(wchar
_t *, const wchar
_t *, size
_t);
wchar
_t *wmemset
(wchar
_t *, wchar
_t, size
_t);
size
_t wcsftime
(wchar
_t *, size
_t, const wchar
_t *, const struct tm *);
wchar
_t *wcstok
(wchar
_t*, const wchar
_t*, wchar
_t**);
strtok
,
but uses the object pointed to
by the third argument to keep state, rather than keeping it
internally as strtok
does.
This change makes it possible to interleave
calls to wcstok
over different input strings.
mbstate
_t
object that they keep their conversion state in.
Such an object can be set to all zeroes (e.g. by
assigning to it the value of an mbstate
_t
object with static lifetime which has not been explicitly
initialized)
and is then in its initial state.
When an object is in the initial state
(no matter how this occurred),
it is prepared for conversion in either direction
(from multibyte to wide characters or vice versa)
starting in the initial state.
Once an object has left its initial state
(which happens whenever it is used with one
of the following functions unless the description says otherwise),
it shall only be used in the same
LC
_CTYPE
category [*]
and same direction as the previous call,
and shall not be used after a conversion error.
If a null pointer is passed, each
function uses its own internal object
which is initialized to all zeroes at program startup.
mbstate
_t
object associated with a stream is bound
to an encoding by the first fgetwc
or fputwc
call after the stream is opened, and can then be used with any locale.
wint
_t btowc
(int);
unsigned char
)
to the corresponding wide character, if any, or else returns
WEOF
.
int wctob
(wint
_t);
EOF
.
int mbsinit
(const mbstate
_t *);
mbstate
_t
object is
in the initial state (the object is unaffected).
size
_t mbrlen
(const char *s, size
_t n, mbstate
_t *pcs);
mbrtowc
(NULL,
s,
n,
pcs)
, except
that it uses its own internal mbstate
_t
object,
not that of mbrtowc
, when given a null pointer.
size
_t
mbrtowc
(wchar
_t
*ws,
const
char
*s,
size
_t
n,
mbstate
_t
*pcs);
s
(inspecting no
more than n
bytes) to a wide character.
If ws
is not a null pointer, the wide character
is stored in *ws
.
If s
is a null pointer, mbrtowc
ignores ws
and n
and acts as if the first
three arguments are a null pointer, an empty string, and 1 respectively.
(size
_t)-2
mbstate
_t
, but no
complete wide character has been found.
(size
_t)-1
0
mbstate
_t
object has been restored to the initial state.
mbstate
_t
object has been updated.
mbstate
_t
object; the inspected
bytes do not need to be
passed to the function a second time.
size
_t wcrtomb
(char *, wchar
_t, mbstate
_t *);
MB
_CUR
_MAX
bytes and
places them in the array pointed to by the
first argument; if the wide character is zero,
the resulting sequence will end in the initial
state,
followed by a zero byte, and the mbstate
_t
object will be in the initial state.
wcrtomb
returns the number of bytes written to the
character buffer, or (size
_t)-1
to indicate an encoding
error (errno
is set to EILSEQ
).
size
_t mbsrtowcs
(wchar
_t *ws, const char **ps, size
_t n,
mbstate
_t *pcs);
*ps
to wide characters.
The result is either (size
_t)-1
if a
conversion error occurs (in which case errno
is set to
EILSEQ
), or else the number
of bytes processed.
ws
is a null pointer,
processing stops at the end of the string
(the terminating zero byte is not counted in the returned value),
and *pcs
will be set to the initial state.
ws
is not a null pointer,
the resulting wide character sequence
is stored in the array it points to.
Conversion stops when:
n
wide characters have been stored;
*pcs
will be set to the conversion state
after processing the indicated number of bytes,
and *ps
will point to the first unprocessed byte
*pcs
will be set to the initial state,
*ps
will be set to a null pointer, and a zero
wide character will have been stored.
size
_t wcsrtombs
(char *s, const wchar
_t **pws, size
_t n,
mbstate
_t *pcs);
pws
to a multibyte character sequence.
The result is either (size
_t)-1
if a conversion
error occurs (in which case errno
is set to
EILSEQ
), or else the number of bytes in the
resulting multibyte string.
Processing of the wide string stops either when a zero wide
character - indicating the end of the wide string - is reached
(the resulting multibyte string will end with a zero byte
which is not included in the returned result), or (if s
is not a null pointer) when it is not possible to process another wide
character without placing more than n
bytes into the
array pointed to by
s
. In the first case, *pcs
will be left in the initial state.
If s
is a null pointer, the value of n
is ignored. Otherwise *pws
will
be set to either a null pointer (if conversion stopped on a
zero wide character) or a pointer to the first unprocessed
wide character. In the latter case, the returned
value will be at least (n-MB
_CUR
_MAX+1)
.
<wctype.h>
reserves function names beginning
with is or to followed by a lowercase
letter.
<wchar.h>
reserves function names beginning
with wcs followed by a lowercase letter.
Lowercase letters are reserved as conversion
specifiers for fwprintf
and fwscanf
.